Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(datasets): add SparkStreamingDataSet #198

Merged
merged 100 commits into from
May 31, 2023

Conversation

tingtingQB
Copy link
Contributor

Description

This PR is a continue for the closed PR #168
to add a spark_stream_dataset to the Kedro dataset plugin

Development notes

1 spark dataset and unit tests

Checklist

  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the relevant RELEASE.md file
  • Added tests to cover my changes

astrojuanlu and others added 30 commits May 1, 2023 19:13
Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>
Signed-off-by: Tingting_Wan <tingting_wan@mckinsey.com>
Signed-off-by: Tingting_Wan <tingting_wan@mckinsey.com>
…org#161)

* Include missing requirements files in sdist

Fix kedro-orggh-86.

Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>

* Migrate most project metadata to `pyproject.toml`

See kedro-org/kedro#2334.

Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>

* Move requirements to `pyproject.toml`

Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>

---------

Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>
Signed-off-by: Tingting_Wan <tingting_wan@mckinsey.com>
Signed-off-by: Tingting_Wan <tingting_wan@mckinsey.com>
Signed-off-by: Tingting_Wan <tingting_wan@mckinsey.com>
Co-authored-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>
Signed-off-by: Tingting_Wan <tingting_wan@mckinsey.com>
Signed-off-by: Tingting_Wan <tingting_wan@mckinsey.com>
Signed-off-by: Tingting_Wan <tingting_wan@mckinsey.com>
Signed-off-by: Tingting_Wan <tingting_wan@mckinsey.com>
Co-authored-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Tingting_Wan <tingting_wan@mckinsey.com>
Signed-off-by: Tingting_Wan <tingting_wan@mckinsey.com>
Signed-off-by: Tingting_Wan <tingting_wan@mckinsey.com>
Signed-off-by: Tingting_Wan <tingting_wan@mckinsey.com>
* Upgrade Polars

Signed-off-by: Juan Luis Cano Rodríguez <hello@juanlu.space>

* Update Polars to 0.17.x

---------

Signed-off-by: Juan Luis Cano Rodríguez <hello@juanlu.space>
Signed-off-by: Tingting_Wan <tingting_wan@mckinsey.com>
)

Signed-off-by: Tingting_Wan <tingting_wan@mckinsey.com>
* Migrate kedro-airflow to static metadata

See kedro-org/kedro#2334.

Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>

* Add explicit PEP 518 build requirements for kedro-datasets

Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>

* Typos

Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com>

Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>

* Remove dangling reference to requirements.txt

Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>

* Add release notes

Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>

---------

Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>
Signed-off-by: Tingting_Wan <tingting_wan@mckinsey.com>
* Migrate kedro-telemetry to static metadata

See kedro-org/kedro#2334.

Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>

* Add release notes

Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>

---------

Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>
Signed-off-by: Tingting_Wan <tingting_wan@mckinsey.com>
* Add unit test + lint test on GA

* trigger GA - will revert

Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>

* Fix lint

Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>

* Add end to end tests

* Add cache key

Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>

* Add cache action

Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>

* Rename workflow files

Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>

* Lint + add comment + default bash

Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>

* Add windows test

Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>

* Update workflow name + revert changes to READMEs

Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>

* Add kedro-telemetry/RELEASE.md to trufflehog ignore

Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>

* Add pytables to test_requirements remove from workflow

Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>

* Revert "Add pytables to test_requirements remove from workflow"

This reverts commit 8203daa.

* Separate pip freeze step

Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>

---------

Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>
Signed-off-by: Tingting_Wan <tingting_wan@mckinsey.com>
* Migrate kedro-docker to static metadata

See kedro-org/kedro#2334.

Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>

* Address packaging warning

Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>

* Fix tests

Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>

* Actually install current plugin with dependencies

Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>

* Add release notes

Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>

---------

Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>
Signed-off-by: Tingting_Wan <tingting_wan@mckinsey.com>
Currently opening gitpod will installed a Python 3.11 which breaks everything because we don't support it set. This PR introduce a simple .gitpod.yml to get it started.

Signed-off-by: Tingting_Wan <tingting_wan@mckinsey.com>
* Update APIDataSet

Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com>

* Sync ParquetDataSet

Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com>

* Sync Test

Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com>

* Linting

Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com>

* Revert Unnecessary ParquetDataSet Changes

Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com>

* Sync release notes

Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com>

---------

Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com>
Signed-off-by: Tingting_Wan <tingting_wan@mckinsey.com>
Signed-off-by: Tingting_Wan <tingting_wan@mckinsey.com>
Signed-off-by: Tingting_Wan <tingting_wan@mckinsey.com>
Signed-off-by: Tingting_Wan <tingting_wan@mckinsey.com>
Signed-off-by: Tingting_Wan <tingting_wan@mckinsey.com>
…ream-datasets

# Conflicts:
#	.github/workflows/check-plugin.yml
#	kedro-datasets/tests/api/test_api_dataset.py
Signed-off-by: Tingting_Wan <tingting_wan@mckinsey.com>
Signed-off-by: Tingting_Wan <tingting_wan@mckinsey.com>
Signed-off-by: Tingting_Wan <tingting_wan@mckinsey.com>
Co-authored-by: Nok Lam Chan <nok.lam.chan@quantumblack.com>
Signed-off-by: Tingting_Wan <tingting_wan@mckinsey.com>
Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com>
Copy link
Member

@deepyaman deepyaman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall, minor comments, and one design question. I haven't read the docs, but I assume Jo/others sufficiently reviewed them. :) I can take a look later today on the train perhaps, but no need to wait.

Comment on lines +89 to +91
if self._schema is not None:
if isinstance(self._schema, dict):
self._schema = SparkDataSet._load_schema_from_file(self._schema)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@noklam is it OK from a design perspective that SparkStreamingDataSet uses a private method of SparkDataSet? Feels a bit off to me, but perhaps no clear issues if their requirements are the same.

# Handle schema load argument
self._schema = self._load_args.pop("schema", None)
if self._schema is not None:
if isinstance(self._schema, dict):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: wonder why this is a nested if statement rather than using and? But it's the same on SparkDataSet, so I guess it's consistent. 🤷 Quite possible I did it when adding schema handling.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed these comment - maybe it should just inherit the SparkDataSet class? #135 I think in general we need to look at all SparkDataSet, many of it is weird but it's quite tricky to remove the code.

the path handling is particular confusing because it's unique for Spark. @deepyaman

Comment on lines +89 to +91
if self._schema is not None:
if isinstance(self._schema, dict):
self._schema = SparkDataSet._load_schema_from_file(self._schema)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine with requiring schema; schema concept is more critical in streaming anyway.

kedro-datasets/setup.py Outdated Show resolved Hide resolved
kedro-datasets/tests/spark/test_spark_streaming_dataset.py Outdated Show resolved Hide resolved
@deepyaman deepyaman changed the title feat(datasets): Add stream datasets and unit tests feat(datasets): add SparkStreamingDataSet May 30, 2023
noklam and others added 18 commits May 30, 2023 09:51
* fixkedro- docker e2e test

Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com>

* fix: add timeout to request to satisfy bandit lint

---------

Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com>
Co-authored-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Tom Kurian <tom_kurian@mckinsey.com>
* ci: install the plugin alongside test requirements

* ci: install the plugin alongside test requirements

* Update kedro-airflow.yml

* Update kedro-datasets.yml

* Update kedro-docker.yml

* Update kedro-telemetry.yml

* Update kedro-airflow.yml

* Update kedro-datasets.yml

* Update kedro-airflow.yml

* Update kedro-docker.yml

* Update kedro-telemetry.yml

* ci(telemetry): update isort config to correct sort

* Don't use profile ¯\_(ツ)_/¯

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* chore(datasets): remove empty `tool.black` section

* chore(docker): remove empty `tool.black` section

---------

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Tom Kurian <tom_kurian@mckinsey.com>
…ro-org#203)

* Create check-release.yml

* change from test pypi to pypi

* split into jobs and move version logic into script

* update github actions output

* lint

* changes based on review

* changes based on review

* fix script to not append continuously

* change pypi api token logic

Signed-off-by: Tom Kurian <tom_kurian@mckinsey.com>
* Less strict pin on Kedro for datasets

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>
Signed-off-by: Tom Kurian <tom_kurian@mckinsey.com>
* ci: don't run checks on both `push`/`pull_request`

* ci: don't run checks on both `push`/`pull_request`

* ci: don't run checks on both `push`/`pull_request`

* ci: don't run checks on both `push`/`pull_request`

Signed-off-by: Tom Kurian <tom_kurian@mckinsey.com>
Signed-off-by: Tom Kurian <tom_kurian@mckinsey.com>
…tests checked. (kedro-org#215)

* Create merge-gatekeeper.yml

* Update .github/workflows/merge-gatekeeper.yml

---------

Co-authored-by: Sajid Alam <90610031+SajidAlamQB@users.noreply.github.com>
Signed-off-by: Tom Kurian <tom_kurian@mckinsey.com>
* remove circleci setup files and utils

* remove circleci configs in kedro-telemetry

* remove redundant .github in kedro-telemetry

* Delete continue_config.yml

* Update check-release.yml

* lint

* increase timeout to 40 mins for docker e2e tests

Signed-off-by: Tom Kurian <tom_kurian@mckinsey.com>
* [FEAT] add save method to APIDataset

Signed-off-by: jmcdonnell <jmcdonnell@fieldbox.ai>

* [ENH] create save_args parameter for api_dataset

Signed-off-by: jmcdonnell <jmcdonnell@fieldbox.ai>

* [ENH] add tests for socket + http errors

Signed-off-by: <jmcdonnell@fieldbox.ai>
Signed-off-by: jmcdonnell <jmcdonnell@fieldbox.ai>

* [ENH] check save data is json

Signed-off-by: <jmcdonnell@fieldbox.ai>
Signed-off-by: jmcdonnell <jmcdonnell@fieldbox.ai>

* [FIX] clean code

Signed-off-by: jmcdonnell <jmcdonnell@fieldbox.ai>

* [ENH] handle different data types

Signed-off-by: jmcdonnell <jmcdonnell@fieldbox.ai>

* [FIX] test coverage for exceptions

Signed-off-by: jmcdonnell <jmcdonnell@fieldbox.ai>

* [ENH] add examples in APIDataSet docstring

Signed-off-by: jmcdonnell <jmcdonnell@fieldbox.ai>

* sync APIDataSet  from kedro's `develop` (kedro-org#184)

* Update APIDataSet

Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com>

* Sync ParquetDataSet

Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com>

* Sync Test

Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com>

* Linting

Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com>

* Revert Unnecessary ParquetDataSet Changes

Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com>

* Sync release notes

Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com>

---------

Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com>
Signed-off-by: jmcdonnell <jmcdonnell@fieldbox.ai>

* [FIX] remove support for delete method

Signed-off-by: jmcdonnell <jmcdonnell@fieldbox.ai>

* [FIX] lint files

Signed-off-by: jmcdonnell <jmcdonnell@fieldbox.ai>

* [FIX] fix conflicts

Signed-off-by: jmcdonnell <jmcdonnell@fieldbox.ai>

* [FIX] remove fail save test

Signed-off-by: jmcdonnell <jmcdonnell@fieldbox.ai>

* [ENH] review suggestions

Signed-off-by: jmcdonnell <jmcdonnell@fieldbox.ai>

* [ENH] fix tests

Signed-off-by: jmcdonnell <jmcdonnell@fieldbox.ai>

* [FIX] reorder arguments

Signed-off-by: jmcdonnell <jmcdonnell@fieldbox.ai>

---------

Signed-off-by: jmcdonnell <jmcdonnell@fieldbox.ai>
Signed-off-by: <jmcdonnell@fieldbox.ai>
Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com>
Co-authored-by: jmcdonnell <jmcdonnell@fieldbox.ai>
Co-authored-by: Nok Lam Chan <mediumnok@gmail.com>
Signed-off-by: Tom Kurian <tom_kurian@mckinsey.com>
…g#212)

* ci: Automatically extract release notes

Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>

* fix lint

Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>

* Raise exceptions

Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>

* Lint

Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>

* Lint

Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>

---------

Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com>
Signed-off-by: Tom Kurian <tom_kurian@mckinsey.com>
* Add metadata attribute to all datasets

Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>
Signed-off-by: Tom Kurian <tom_kurian@mckinsey.com>
…icks (kedro-org#206)

* committing first version of UnityTableCatalog with unit tests. This datasets allows users to interface with Unity catalog tables in Databricks to both read and write.

Signed-off-by: Danny Farah <danny_farah@mckinsey.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* renaming dataset

Signed-off-by: Danny Farah <danny_farah@mckinsey.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* adding mlflow connectors

Signed-off-by: Danny Farah <danny_farah@mckinsey.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* fixing mlflow imports

Signed-off-by: Danny Farah <danny_farah@mckinsey.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* cleaned up mlflow for initial release

Signed-off-by: Danny Farah <danny_farah@mckinsey.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* cleaned up mlflow references from setup.py for initial release

Signed-off-by: Danny Farah <danny_farah@mckinsey.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* fixed deps in setup.py

Signed-off-by: Danny Farah <danny_farah@mckinsey.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* adding comments before intiial PR

Signed-off-by: Danny Farah <danny_farah@mckinsey.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* moved validation to dataclass

Signed-off-by: Danny Farah <danny_farah@mckinsey.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* bug fix in type of partition column and cleanup

Signed-off-by: Danny Farah <danny_farah@mckinsey.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* updated docstring for ManagedTableDataSet

Signed-off-by: Danny Farah <danny_farah@mckinsey.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* added backticks to catalog

Signed-off-by: Danny Farah <danny_farah@mckinsey.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* fixing regex to allow hyphens

Signed-off-by: Danny Farah <danny_farah@mckinsey.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Update kedro-datasets/kedro_datasets/databricks/managed_table_dataset.py

Co-authored-by: Jannic <37243923+jmholzer@users.noreply.github.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Update kedro-datasets/kedro_datasets/databricks/managed_table_dataset.py

Co-authored-by: Jannic <37243923+jmholzer@users.noreply.github.com>
Signed-off-by: Danny Farah <danny_farah@mckinsey.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Update kedro-datasets/kedro_datasets/databricks/managed_table_dataset.py

Co-authored-by: Jannic <37243923+jmholzer@users.noreply.github.com>
Signed-off-by: Danny Farah <danny_farah@mckinsey.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Update kedro-datasets/kedro_datasets/databricks/managed_table_dataset.py

Co-authored-by: Jannic <37243923+jmholzer@users.noreply.github.com>
Signed-off-by: Danny Farah <danny_farah@mckinsey.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Update kedro-datasets/kedro_datasets/databricks/managed_table_dataset.py

Co-authored-by: Jannic <37243923+jmholzer@users.noreply.github.com>
Signed-off-by: Danny Farah <danny_farah@mckinsey.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Update kedro-datasets/kedro_datasets/databricks/managed_table_dataset.py

Co-authored-by: Jannic <37243923+jmholzer@users.noreply.github.com>
Signed-off-by: Danny Farah <danny_farah@mckinsey.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Update kedro-datasets/kedro_datasets/databricks/managed_table_dataset.py

Co-authored-by: Jannic <37243923+jmholzer@users.noreply.github.com>
Signed-off-by: Danny Farah <danny_farah@mckinsey.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Update kedro-datasets/test_requirements.txt

Co-authored-by: Jannic <37243923+jmholzer@users.noreply.github.com>
Signed-off-by: Danny Farah <danny_farah@mckinsey.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Update kedro-datasets/kedro_datasets/databricks/managed_table_dataset.py

Co-authored-by: Jannic <37243923+jmholzer@users.noreply.github.com>
Signed-off-by: Danny Farah <danny_farah@mckinsey.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Update kedro-datasets/kedro_datasets/databricks/managed_table_dataset.py

Co-authored-by: Jannic <37243923+jmholzer@users.noreply.github.com>
Signed-off-by: Danny Farah <danny_farah@mckinsey.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Update kedro-datasets/kedro_datasets/databricks/managed_table_dataset.py

Co-authored-by: Jannic <37243923+jmholzer@users.noreply.github.com>
Signed-off-by: Danny Farah <danny_farah@mckinsey.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Update kedro-datasets/kedro_datasets/databricks/managed_table_dataset.py

Co-authored-by: Jannic <37243923+jmholzer@users.noreply.github.com>
Signed-off-by: Danny Farah <danny_farah@mckinsey.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* adding backticks to catalog

Signed-off-by: Danny Farah <danny_farah@mckinsey.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Require pandas < 2.0 for compatibility with spark < 3.4

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Replace use of walrus operator

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Add test coverage for validation methods

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Remove unused versioning functions

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Fix exception catching for invalid schema, add test for invalid schema

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Add pylint ignore

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Add tests/databricks to ignore for no-spark tests

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Update kedro-datasets/kedro_datasets/databricks/managed_table_dataset.py

Co-authored-by: Nok Lam Chan <mediumnok@gmail.com>

* Update kedro-datasets/kedro_datasets/databricks/managed_table_dataset.py

Co-authored-by: Nok Lam Chan <mediumnok@gmail.com>

* Remove spurious mlflow test dependency

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Add explicit check for database existence

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Remove character limit for table names

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Refactor validation steps in ManagedTable

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Remove spurious checks for table and schema name existence

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

---------

Signed-off-by: Danny Farah <danny_farah@mckinsey.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Co-authored-by: Danny Farah <danny.farah@quantumblack.com>
Co-authored-by: Danny Farah <danny_farah@mckinsey.com>
Co-authored-by: Nok Lam Chan <mediumnok@gmail.com>
Signed-off-by: Tom Kurian <tom_kurian@mckinsey.com>
* Update APIDataset docs and refactor

* Acknowledge community contributor

* Fix more broken doc

Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com>

* Lint

Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>

* Fix release notes of upcoming kedro-datasets

---------

Signed-off-by: Nok Chan <nok.lam.chan@quantumblack.com>
Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>
Co-authored-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>
Co-authored-by: Jannic <37243923+jmholzer@users.noreply.github.com>
Signed-off-by: Tom Kurian <tom_kurian@mckinsey.com>
* Modify release version and RELEASE.md

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Add proper name for ManagedTableDataSet

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Update kedro-datasets/RELEASE.md

Co-authored-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>

* Revert lost semicolon for release 1.2.0

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

---------

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Co-authored-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>
Signed-off-by: Tom Kurian <tom_kurian@mckinsey.com>
* Fix APIDataSet docstring

Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>

* Add release notes

Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>

* Separate [docs] extras from [all] in kedro-datasets

Fix kedro-orggh-143.

Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>

---------

Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>
Signed-off-by: Tom Kurian <tom_kurian@mckinsey.com>
Co-authored-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Tom Kurian <tom_kurian@mckinsey.com>
Co-authored-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Tom Kurian <tom_kurian@mckinsey.com>
Co-authored-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Tom Kurian <tom_kurian@mckinsey.com>
@noklam noklam mentioned this pull request May 30, 2023
5 tasks
@noklam noklam self-requested a review May 30, 2023 16:22
Copy link
Contributor

@noklam noklam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is huge 🔥 , thank you both @tingtingQB and @kuriantom369. I'm gonna release it in 1.3.1 if nothing fails in CI.

Signed-off-by: Tom Kurian <tom_kurian@mckinsey.com>
@noklam noklam merged commit 2acb007 into kedro-org:main May 31, 2023
@@ -0,0 +1,44 @@
# Spark Streaming
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really like how concise is this file here!

`YAML API <https://kedro.readthedocs.io/en/stable/data/\
data_catalog.html#use-the-data-catalog-with-the-yaml-api>`_:
.. code-block:: yaml
raw.new_inventory:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
raw.new_inventory:
raw.new_inventory:

data_catalog.html#use-the-data-catalog-with-the-yaml-api>`_:
.. code-block:: yaml
raw.new_inventory:
type: streaming.extras.datasets.spark_streaming_dataset.SparkStreamingDataSet
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
type: streaming.extras.datasets.spark_streaming_dataset.SparkStreamingDataSet
type: spark.SparkStreamingDataSet

@astrojuanlu
Copy link
Member

Also, the docstring has some problems:

sphinx.errors.SphinxWarning: /home/docs/checkouts/readthedocs.org/user_builds/kedro/envs/2485/lib/python3.8/site-packages/kedro_datasets/spark/spark_streaming_dataset.py:docstring of kedro_datasets.spark.spark_streaming_dataset.SparkStreamingDataSet:5:Unexpected indentation.

https://readthedocs.org/projects/kedro/builds/20874133/

Will open a PR to fix this along with @idanov comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Community Issue/PR opened by the open-source community
Projects
None yet
Development

Successfully merging this pull request may close these issues.